improve extracting PDF text #572
                
     Closed
            
            
          
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
You're already using the best langchain pdf loader, "pymupdfparser." However, the default "mode" parameter is "page," which means that it extracts the text from each page of the .pdf (and it's page metadata) and each page is then split by the
recursivecharactertextsplitter. The "page" mode within langchain uses theget_textmethod frompymupdf, which extracts text from a single page of PDF.Langchain's other option is "single" mode, which also uses
get_textfrompymupdfbut then concatenates everything. The huge drawback is that you lose the page metadata...and forget about trying to assign page metadata to each "chunk"...This PR solves that issue, ultimately allowing for accurate "page citations" in a user's application.
It uses custom loader/parser classes:
pagemode, but prepends a unique page marker to the text extracted (e.g. [[page1]], [[page2]] and so on).regexto search the concatenated text WITH the page markers to determine where each "chunk" begins by looking for the first page marker PRIOR to that chunk.The benefits of this are that chunks of text are no longer artificially split at the page boundaries of the pdf itself - i.e. chunks can span pages.
The only "parser" from langchain does does this "out of the box" is
pdfminer, but it's insanely slow. Therefore, I highly recommend using this custompymupdfapproach instead.Moreover, it more accurately respects the
chunk_sizeparameter. For example, if a pdf only has 200 characters on a particular page of a PDF you'll get a chunk of 200 characters even if you setchunk_sizeto 1,000,000. The custom approach allows chunks to extend between pages, obviating this problem.Many embedding models can now handle chunk sizes well above the standard 512 characters and there are use cases for it...but overall it's just better to have accurate page metadata for each "chunk" after processing the entire concatenated text...